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The construction and use of Simulation Test Data (STD) 
to help evaluate alternative chemometric methodologies is a 
highly welcome contribution to the field. Dr. Currie, and the 
agencies and colleagues whom he credits, are to be congrat- 
ulated for an approach which has the potential to promote 
improvements in the art of quantitative chemical analysis. 

What follows is a brief discussion of some previous use 
of standard data sets in statistical research, along with some 
warnings about the possible pitfalls connected with the use 
of such approaches. In particular, the parallel that Dr. Currie 
draws between the use of standard data sets and interlabora- 
tory comparisons using common reference materials cannot 
be pushed too far. Many interacting factors lead to bias in 
modeling and analysis of complex data sets; the contribu- 
tions of these factors would be confounded in typical inter- 
laboratory comparison designs. One factor, scientific judg- 



ment, cannot even be identified in standard frequentist 
reports of statistical data analysis. This suggests that subjec- 
tive scientific judgments need to be given more explicit 
mention in reports of statistical analyses, perhaps through 
the use of the Bayesian approach to inference. 

To make standard data sets more closely resemble real- 
world data, the use of the "bootstrap" is suggested. The 
"bootstrap" can also help in providing the estimates of statis- 
tical precision that Dr. Currie notes were lacking in the two 
studies conducted to date. 



Standard Data Sets in Statistics 

Statisticians have long recognized the usefulness of hav- 
ing common data sets on which new methodologies can be 
tried out, and their relative merits assessed. For example, 
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new methodologies for classification and discriminant anal- 
ysis are often applied to the iris data of E. Anderson [I] 1 , 
which was featured in a famous paper by R. A. Fisher [2]. 
(To Anderson's undoubted frustration, these data are usu- 
ally referred to as "Fisher's iris data.") 

Another famous data set is Longley's [3] econometric 
linear regression data. In these data, the independent vari- 
ables in the regression are highly interrelated (multicolin- 
ear). Longley ran the data through several computer soft- 
ware packages designed to do least squares analysis. In 
theory, all of these programs solve the same set of linear 
equations to estimate the regression slopes. However, the 
solutions obtained by the various algorithms differed, in 
some cases even by sign! What had happened was that the 
multicollinearity in the data made the answers obtained 
highly sensitive to roundoff and truncation of the data, and 
the algorithms differed by where and by how much roundoffs 
were done. Longley's paper had the very beneficial conse- 
quence that software developers now pay careful attention to 
numerical analysis in designing statistical algorithms. Fur- 
ther, it stimulated study of the resistance of statistical 
methodology to data perturbations (robustness). 

However, Longley's paper (and particularly his data) may 
also have had a less salutory effect on software develop- 
ment. Software developers now know that consumers will 
test out their programs on Longley's data. [See, for exam- 
ple, Lachenbruch's review [4] of STAN, Version II. by 
David Allen.] This may lead them to overcompensate for 
multicollinearity problems, and consequently overlook or 
neglect other potential problems or sacrifice desirable fea- 
tures to include subroutines necessary to accurately process 
multicollinear data. 

This last comment points out a real danger in the use of 
standard data sets, namely that their existence can bias the 
direction which development of methodology and software 
takes. The best guard against such bias is the creation of 
standard data sets of many types. 

An artificial standard data set (simulated according to a 
known model for the distribution of errors) can lead to a 
particularly serious bias. Chemometricians who know that 
their work will be evaluated by such data sets will tend to 
use a methodology which is known to be efficient for the 
given statistical model. Such a methodology, however, may 
not do well against real data, for which the given statistical 
model is not necessarily a good approximation. Alterna- 
tively, chemometricians may object to evaluations on the 
basis of such data, arguing (with considerable merit) that 
such data do not reflect their practical experience. 

The reason, of course, for using artificial data is that the 
"truth" or "signal" underlying the "noise" (error) in the data 
is known. This allows us to separate bias (lack of validity) 
from precision (reliability, repeatability). In this respect 



there is an obvious parallel, which Dr. Currie correctly 
points out, with the use of common reference materials in 
interlaboratory comparisons. The goal of such studies is to 
eliminate bias (which is usually reflected in interlaboratory 
variation), and to estimate precision (intralaboratory varia- 
tion). However, whereas common reference materials are 
"real" (although they may be ideal examples of materials 
analyzed in practice), this is not clearly the case with data 
simulated from specified statistical populations (e.g., Gaus- 
sian populations). Real populations may have "heavy tails" 
and/or other funny features (e.g., several modes) which are 
not modeled by standard distributions. 

One obvious solution is to vary the distributional assump- 
tions which generate the errors in artificial data. This ap- 
proach is widely used in statistics to study the robustness 
properties of statistical methodologies. 

Another possible solution is to use the ideas underlying 
the "bootstrap" (Efron [5], Diaconis and Efron [6], Freed- 
man and Peters [7,8]) to simulate data which have "real 
world" error distributions. 



The Bootstrap 

In using the "bootstrap," we start by assuming that the 
observed data _>>,■ are related to unknown parameters and 
errors e t by a model 



J,=G(9 A ), 



i = l,2,...,n , 



(1) 



where G(v) is known. Given a value for (which may be 
a vector), we assume that the eq (1) can be inverted to obtain 
the errors <?,-. That is, 



e,=// 1 -(9,y,,...j„), 



i = \,...,n. 



(2) 



'Figures in brackets indicate literature references. 



Given "real" data y x , yi,.-.,y n , with n sufficiently large to 
give us some hope of accurately estimating by 

e=e(y,,...,y„), 

we now construct the residuals (estimated errors) 

e i =H l (d,y [ y„), / = l,2,...,n. (3) 

The resulting finite population {e u ...,e n } of residuals is the 
statistical population from which we can randomly sample 
new errors e ( - , i = 1 ,2,. . , ,m , to create standard data sets 

y,—G (()*,?,•), i=l,2,...,m, 

where 9* can be chosen to have any desired value. 
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The data sets so simulated are not entirely "real." The 
model (1) relating observations y t to errors e-, must still be 
specified, and need not be correct. However, such models 
can be specified (and criticized) taking account of chemical 
and physical theory, without also imposing statistical as- 
sumptions about distributions of errors. As such, these mod- 
els are "prestatistical." 

A population of residuals e- t that is too small can misrep- 
resent statistical variation. Thus, attempts should be made to 
constantly enlarge this population with new residuals ob- 
tained from real data obtained in contexts described by the 
model (1). Since the "bootstrap" is a fairly recent statistical 
development, new insights into problems and advantages 
connected with the method are constantly being published. 
Consequently, the input of specialists in "bootstrap" 
methodology should be sought when applying this method 
to the generation of standard data sets. In particular, changes 
in instrumentation, personnel, or experimental design may 
change the error population over time. Careful attention 
should be paid to detect such shifts in distribution. 

Not all measurement contexts lend themselves to the 
"bootstrap," since the transformation (2) from observations 
to errors may not exist, or may not be well defined. (This 
may be the case, for example, with the Gamma Ray Spec- 
trum Analysis example discussed by Dr. Currie.) However, 
when the "bootstrap" does apply, it can be used both to 
create standard data sets, and also to provide nonparametric 
estimates of precision [5,7,8]. 



Estimates of Precision 

Although I share Dr. Currie's concern that the laborato- 
ries in his two examples either failed to provide estimates of 
precision, or gave incorrect estimates, 1 must point out that 
in Dr. Currie's two examples, it is not clear what measures 
of precision are appropriate. In all of the analyses, multiple 
decisions are made. For example, in the Gamma-Ray exam- 
ples, the locations and amplitudes of several peaks had to be 
determined simultaneously. Although individual standard 
errors can be given, these do not directly provide measures 
of simultaneous accuracy [9]. Further in the detection spec- 
trum and precision spectra sets, even the number of peaks 
was unknown. This produces a highly complicated estima- 
tion problem for which only large- sample approximations to 
precision are available. There is some evidence in the liter- 
ature that such large-sample approximations have consider- 
able bias in moderate samples (see, e.g., [6]). ■ 

Similar problems arise in the three data sets for the NBS- 
EPA Source Apportionment study, particularly in the case 
of Data Set I, where the number of sources is left unspeci- 
fied. Both ridge regression and factor analysis are ex- 
ploratory methodologies, requiring iteration and judgment 
that are difficult to describe analytically. The only available 



measures of precison for such techniques are large-sample 
approximations which refer to analytical formulas for the 
estimators not directly related to the way such estimators are 
actually obtained. For example, I know of no way to specify 
the precision of estimates of slope obtained by the ridge 
trace method. Published formulas for the precisions of ridge 
regression estimators refer to those estimators in which the 
ridge factor & is a specified function of the data, rather than 
being obtained by inspection of the ridge trace. 

Given the complex natures of the estimation problems 
that Dr. Currie describes, and the fact that statistical theory 
has not yet provided reasonable estimates of precision for 
some of the methodologies used in these problems, it is not 
surprising that the laboratories either failed to provide mea- 
sures of precision, or gave estimates that were off the mark. 
Clearly, there is much theoretical statistical work yet to be 
done. 

In the meantime, it should be mentioned again that the 
"bootstrap" can provide estimates of precison in cases where 
the assumptions (1), (2) underlying the bootstrap are appli- 
cable. 

Analogy to CMP Interlaboratory 
Comparisons 

As already noted, Dr. Currie makes an analogy between 
the use of standard data sets in the two examples he dis- 
cusses, and traditional interlaboratory comparisons using 
common reference materials. However, this analogy cannot 
be carried too far, since there are some important differences 
in context. 

In traditional interlaboratory comparisons, differences 
between laboratories are usually assumed to be due to vari- 
ations in the calibration (adjustment) of the instruments, or 
to differences in instrumentation or technique. Conse- 
quently, a one factor (additive) components of variance 
ANOVA model can reasonably be employed to assign vari- 
ability between inter- and intra-laboratory sources. 

In the standard data set context described by Dr. Currie, 
however, there are at least three factors which can describe 
variability between laboratories: 1) different models or as- 
sumptions used, 2) different statistical methodologies em- 
ployed, and 3) different numerical algorithms. Further, the 
"levels" of these factors (particularly factor 1) appropriate to 
describe a given laboratory's analysis are not always appar- 
ent. (Not all assumptions made are clearly stated). For ex- 
ample, "outliers" can be discarded, parameter values can be 
truncated (e.g., negative estimated amplitudes reported as 
zero), or several different analyses may be run but only one 
(the one that the laboratory thinks is "right") reported. Con- 
sequently, it will be difficult to separate sources of interlab- 
oratory variation. 

Even worse, even if all factor levels can be accurately 
identified (or set in advance), there is the clear possibility of 
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interaction among factors, making interpretation of results 
difficult. For example, different methodologies work "best" 
in the context of different models, and different models and 
methodologies lead to different algorithms and thus to dif- 
ferent reasons for numerical instability when such al- 
gorithms are applied to data. 

How do we then interpret Dr. Currie's examples? First, 
and most important, we see that in certain far more precisely 
defined contexts (in terms of model) than would be met in 
practice there are wide variations in conclusions between 
laboratories, and also wide variations from the correct an- 
swer. Second, the divergences between laboratories cannot 
be assigned to sampling variation because (and this is the 
beauty of the standard data sets) the data are fixed. How- 
ever, the divergence of the centroid of the laboratory conclu- 
sions from the truth may be due to sampling variation (the 
sample did not represent the population), or to poor labora- 
tory conclusion-making processes, or to both. We cannot 
partition this last variability in terms of possible causes, 
because no accurate measures of precision over sampling 
variation are provided by the laboratories, or (in some cases) 
known. 

To better understand the sources of interlaboratory varia- 
tion, we need to start with pilot studies that control the levels 
of each of the factors (1-3) listed above. To establish the 
contribution of algorithms to interlaboratory variability, we 
need to ask numerical analysts to study the possible numer- 
ical errors that can occur in algorithms, describe the situa- 
tions that produce these errors, and suggest remedies to 
reduce such errors. (Here, our "pilot sample" design fixes 
all factors but the "algorithm" factor.) To establish the con- 
tribution of methodology (or rather methodology - model 
interactions) to such variability, chemometricians (particu- 
larly statisticians) need to use mathematical analysis and 
simulaton to identify formulas for the precisions (sampling 
variability) that can be assigned to the various methodolo- 
gies in the contexts of various models (Here, we assume a 
fixed, perfectly accurate algorithm, and vary combinations 
of method and model.) Finally, we need to study and assess 
the variability due to choice of model, and also to the other 
"scientific judgments" made by a laboratory in choosing 
methodology and algorithms and in announcing measures of 
precision. It is particularly in this last type of study that 
standard data sets, both real and artificial, can be most 
useful. 



Scientific Judgment and Bayesian Inference 

The question of how to analyze the biases introduced by 
"scientific judgment" has a direct relationship to a long- 
standing controversy between classical (frequentist) statisti- 
cians and those statisticians who advocate a Bayesian ap- 



proach. Scientific decision-making involves subjective 
judgments about both models and types of permissible con- 
clusions. When such judgments are unstated, we have seen 
that this can obscure our understanding of how decisions are 
reached, and thus prevent us from finding sources of "bias" 
or error. 

Bayesian statisticians, who try to mathematically model 
their subjective judgments in terms of prior probabilities 
over unknown parameters (and models), are often accused 
by frequentist statisticians of proposing analyses that lack 
"scientific objectivity." Clearly the contrary is true. The 
scientist who claims to base conclusions only on the 
"objective" evidence provided by observed frequencies is 
nevertheless often guilty of imposing unstated judgments on 
such evidence. The Bayesian, at least, tries to bring these 
judgments into the open, where they can be assessed along 
with the data. Even if we doubt that probability models can 
ever serve as adequate models of subjective belief, we can 
still applaud the Bayesian's efforts to expose the methods by 
which this belief interacts with the evidence in the data to 
produce new judgments or belief. Rather than criticize the 
Bayesian for being "subjective" or "biased", the frequentists 
need to find ways of making their own decision-making 
processes available for objective study, so that we can gain 
the opportunity to learn how to improve scientific judgment. 
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